14 research outputs found
Vermeidung von ReprÀsentationsheterogenitÀten in realweltlichen Wissensgraphen
Knowledge graphs are repositories providing factual knowledge about entities. They are a great source of knowledge to support modern AI applications for Web search, question answering, digital assistants, and online shopping. The advantages of machine learning techniques and the Web's growth have led to colossal knowledge graphs with billions of facts about hundreds of millions of entities collected from a large variety of sources. While integrating independent knowledge sources promises rich information, it inherently leads to heterogeneities in representation due to a large variety of different conceptualizations. Thus, real-world knowledge graphs are threatened in their overall utility. Due to their sheer size, they are hardly manually curatable anymore. Automatic and semi-automatic methods are needed to cope with these vast knowledge repositories. We first address the general topic of representation heterogeneity by surveying the problem throughout various data-intensive fields: databases, ontologies, and knowledge graphs. Different techniques for automatically resolving heterogeneity issues are presented and discussed, while several open problems are identified. Next, we focus on entity heterogeneity. We show that automatic matching techniques may run into quality problems when working in a multi-knowledge graph scenario due to incorrect transitive identity links. We present four techniques that can be used to improve the quality of arbitrary entity matching tools significantly. Concerning relation heterogeneity, we show that synonymous relations in knowledge graphs pose several difficulties in querying. Therefore, we resolve these heterogeneities with knowledge graph embeddings and by Horn rule mining. All methods detect synonymous relations in knowledge graphs with high quality. Furthermore, we present a novel technique for avoiding heterogeneity issues at query time using implicit knowledge storage. We show that large neural language models are a valuable source of knowledge that is queried similarly to knowledge graphs already solving several heterogeneity issues internally.Wissensgraphen sind eine wichtige Datenquelle von EntitĂ€tswissen. Sie unterstĂŒtzen viele moderne KI-Anwendungen. Dazu gehören unter anderem Websuche, die automatische Beantwortung von Fragen, digitale Assistenten und Online-Shopping. Neue Errungenschaften im maschinellen Lernen und das auĂerordentliche Wachstum des Internets haben zu riesigen Wissensgraphen gefĂŒhrt. Diese umfassen hĂ€ufig Milliarden von Fakten ĂŒber Hunderte von Millionen von EntitĂ€ten; hĂ€ufig aus vielen verschiedenen Quellen. WĂ€hrend die Integration unabhĂ€ngiger Wissensquellen zu einer groĂen Informationsvielfalt fĂŒhren kann, fĂŒhrt sie inhĂ€rent zu HeterogenitĂ€ten in der WissensreprĂ€sentation. Diese HeterogenitĂ€t in den Daten gefĂ€hrdet den praktischen Nutzen der Wissensgraphen. Durch ihre GröĂe lassen sich die Wissensgraphen allerdings nicht mehr manuell bereinigen. DafĂŒr werden heutzutage hĂ€ufig automatische und halbautomatische Methoden benötigt. In dieser Arbeit befassen wir uns mit dem Thema ReprĂ€sentationsheterogenitĂ€t. Wir klassifizieren HeterogenitĂ€t entlang verschiedener Dimensionen und erlĂ€utern HeterogenitĂ€tsprobleme in Datenbanken, Ontologien und Wissensgraphen. Weiterhin geben wir einen knappen Ăberblick ĂŒber verschiedene Techniken zur automatischen Lösung von HeterogenitĂ€tsproblemen. Im nĂ€chsten Kapitel beschĂ€ftigen wir uns mit EntitĂ€tsheterogenitĂ€t. Wir zeigen Probleme auf, die in einem Multi-Wissensgraphen-Szenario aufgrund von fehlerhaften transitiven Links entstehen. Um diese Probleme zu lösen stellen wir vier Techniken vor, mit denen sich die QualitĂ€t beliebiger Entity-Alignment-Tools deutlich verbessern lĂ€sst. Wir zeigen, dass RelationsheterogenitĂ€t in Wissensgraphen zu Problemen bei der Anfragenbeantwortung fĂŒhren kann. Daher entwickeln wir verschiedene Methoden um synonyme Relationen zu finden. Eine der Methoden arbeitet mit hochdimensionalen Wissensgrapheinbettungen, die andere mit einem Rule Mining Ansatz. Beide Methoden können synonyme Relationen in Wissensgraphen mit hoher QualitĂ€t erkennen. DarĂŒber hinaus stellen wir eine neuartige Technik zur Vermeidung von HeterogenitĂ€tsproblemen vor, bei der wir eine implizite WissensreprĂ€sentation verwenden. Wir zeigen, dass groĂe neuronale Sprachmodelle eine wertvolle Wissensquelle sind, die Ă€hnlich wie Wissensgraphen angefragt werden können. Im Sprachmodell selbst werden bereits viele der HeterogenitĂ€tsprobleme aufgelöst, so dass eine Anfrage heterogener Wissensgraphen möglich wird
Evaluating the Knowledge Base Completion Potential of GPT
Structured knowledge bases (KBs) are an asset for search engines and other
applications, but are inevitably incomplete. Language models (LMs) have been
proposed for unsupervised knowledge base completion (KBC), yet, their ability
to do this at scale and with high accuracy remains an open question. Prior
experimental studies mostly fall short because they only evaluate on popular
subjects, or sample already existing facts from KBs. In this work, we perform a
careful evaluation of GPT's potential to complete the largest public KB:
Wikidata. We find that, despite their size and capabilities, models like GPT-3,
ChatGPT and GPT-4 do not achieve fully convincing results on this task.
Nonetheless, they provide solid improvements over earlier approaches with
smaller LMs. In particular, we show that, with proper thresholding, GPT-3
enables to extend Wikidata by 27M facts at 90% precision.Comment: 12 pages 4 table
Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation
Prompt Tuning is emerging as a scalable and cost-effective method to
fine-tune Pretrained Language Models (PLMs), which are often referred to as
Large Language Models (LLMs). This study benchmarks the performance and
computational efficiency of Prompt Tuning and baselines for multi-label text
classification. This is applied to the challenging task of classifying
companies into an investment firm's proprietary industry taxonomy, supporting
their thematic investment strategy. Text-to-text classification is frequently
reported to outperform task-specific classification heads, but has several
limitations when applied to a multi-label classification problem where each
label consists of multiple tokens: (a) Generated labels may not match any label
in the label taxonomy; (b) The fine-tuning process lacks permutation invariance
and is sensitive to the order of the provided labels; (c) The model provides
binary decisions rather than appropriate confidence scores. Limitation (a) is
addressed by applying constrained decoding using Trie Search, which slightly
improves classification performance. All limitations (a), (b), and (c) are
addressed by replacing the PLM's language head with a classification head,
which is referred to as Prompt Tuned Embedding Classification (PTEC). This
improves performance significantly, while also reducing computational costs
during inference. In our industrial application, the training data is skewed
towards well-known companies. We confirm that the model's performance is
consistent across both well-known and less-known companies. Our overall results
indicate the continuing need to adapt state-of-the-art methods to
domain-specific tasks, even in the era of PLMs with strong generalization
abilities. We release our codebase and a benchmarking dataset at
https://github.com/EQTPartners/PTEC
Large Language Models and Knowledge Graphs: Opportunities and Challenges
Large Language Models (LLMs) have taken Knowledge Representation -- and the
world -- by storm. This inflection point marks a shift from explicit knowledge
representation to a renewed focus on the hybrid representation of both explicit
knowledge and parametric knowledge. In this position paper, we will discuss
some of the common debate points within the community on LLMs (parametric
knowledge) and Knowledge Graphs (explicit knowledge) and speculate on
opportunities and visions that the renewed focus brings, as well as related
research topics and challenges.Comment: 30 page
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
embeddings.zip
These are the resources for our research paper Do Embeddings Actually Capture Knowledge Graph Semantics? published at the ESWC 2021 conference in research papers track.</p
Semantic Query Processing: Estimating Relational Purity
The use of semantic information found in structured knowledge bases has become an integral part of the processing pipeline of modern intelligent in-<br/>formation systems. However, such semantic information is frequently insuffi-cient to capture the rich semantics demanded by the applications, and thus cor-pus-based methods employing natural language processing techniques are often used conjointly to provide additional information. However, the semantic expres-siveness and interaction of these data sources with respect to query processing result quality is often not clear. Therefore, in this paper, we introduce the notion of relational purity which represents how well the explicitly modelled relation-ships between two entities in a structured knowledge base capture the implicit (and usually more diverse) semantics found in corpus-based word embeddings. <br/>The purity score gives valuable insights into the completeness of a knowledge base, but also into the expected quality of complex semantic queries relying on reasoning over relationships, as for example analogy queries
Prompting as Probing:Using Language Models for Knowledge Base Construction
Language Models (LMs) have proven to be useful in various downstream
applications, such as summarisation, translation, question answering and text
classification. LMs are becoming increasingly important tools in Artificial
Intelligence, because of the vast quantity of information they can store. In
this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a
large Language Model originally proposed by OpenAI in 2020, to perform the task
of Knowledge Base Construction (KBC). ProP implements a multi-step approach
that combines a variety of prompting techniques to achieve this. Our results
show that manual prompt curation is essential, that the LM must be encouraged
to give answer sets of variable lengths, in particular including empty answer
sets, that true/false questions are a useful device to increase precision on
suggestions generated by the LM, that the size of the LM is a crucial factor,
and that a dictionary of entity aliases improves the LM score. Our evaluation
study indicates that these proposed techniques can substantially enhance the
quality of the final predictions: ProP won track 2 of the LM-KBC competition,
outperforming the baseline by 36.4 percentage points. Our implementation is
available on https://github.com/HEmile/iswc-challenge.Comment: Published in LM-KBC 22: Knowledge Base Construction from Pre-trained
Language Models, Challenge at ISWC 2022. 12+12 page
Semantic Query Processing: Estimating Relational Purity
The use of semantic information found in structured knowledge bases has become an integral part of the processing pipeline of modern intelligent in-formation systems. However, such semantic information is frequently insuffi-cient to capture the rich semantics demanded by the applications, and thus cor-pus-based methods employing natural language processing techniques are often used conjointly to provide additional information. However, the semantic expres-siveness and interaction of these data sources with respect to query processing result quality is often not clear. Therefore, in this paper, we introduce the notion of relational purity which represents how well the explicitly modelled relation-ships between two entities in a structured knowledge base capture the implicit (and usually more diverse) semantics found in corpus-based word embeddings. The purity score gives valuable insights into the completeness of a knowledge base, but also into the expected quality of complex semantic queries relying on reasoning over relationships, as for example analogy queries.Web Information System
Prompting as Probing: Using Language Models for Knowledge Base Construction
Language Models (LMs) have proven to be useful in various downstream applications, such as sum-marisation, translation, question answering and text classification. LMs are becoming increasingly important tools in Artificial Intelligence, because of the vast quantity of information they can store. In this work, we present ProP (Prompting as Probing), which utilizes GPT-3, a large Language Model originally proposed by OpenAI in 2020, to perform the task of Knowledge Base Construction (KBC). ProP implements a multi-step approach that combines a variety of prompting techniques to achieve this. Our results show that manual prompt curation is essential, that the LM must be encouraged to give answer sets of variable lengths, in particular including empty answer sets, that true/false questions are a useful device to increase precision on suggestions generated by the LM, that the size of the LM is a crucial factor, and that a dictionary of entity aliases improves the LM score. Our evaluation study indicates that these proposed techniques can substantially enhance the quality of the final predictions: ProP won track 2 of the LM-KBC competition, outperforming the baseline by 36.4 percentage points. Our implementation is available on https://github.com/HEmile/iswc-challenge